Abstract
This work presents an exploratory data analysis of the current and predicted trajectories of global population growth. The growth, or lack thereof, of a population is a grossly multifaceted equation whose trajectory must be sociologically explored on an individual, regional level. The following will be an exploration of the cross-sectional trends as predicted by the UN’s 2024 population data set.
#read file
library(readxl)
file <- "data/WPP2024_GEN_F01_DEMOGRAPHIC_INDICATORS_COMPACT.xlsx"
#correct column types string
wgdic_xlsx_coltypes= c('numeric', 'text', 'text', 'text', 'numeric', 'text',
'text', 'numeric', 'text', 'numeric', 'numeric', 'numeric',
'numeric', 'numeric', 'numeric', 'numeric', 'numeric',
'numeric', 'numeric', 'numeric', 'numeric', 'numeric',
'numeric', 'numeric', 'numeric', 'numeric', 'numeric',
'numeric', 'numeric', 'numeric', 'numeric', 'numeric',
'numeric', 'numeric', 'numeric', 'numeric', 'numeric',
'numeric', 'numeric', 'numeric', 'numeric', 'numeric',
'numeric', 'numeric', 'numeric', 'numeric', 'numeric',
'numeric', 'numeric', 'numeric', 'numeric', 'numeric',
'numeric', 'numeric', 'numeric', 'numeric', 'numeric',
'numeric', 'numeric', 'numeric', 'numeric', 'numeric',
'numeric', 'numeric', 'numeric')
estimates_df <- read_excel(file, sheet = "Estimates", skip = 16,
col_types = wgdic_xlsx_coltypes, na = c('...',''))
#found '...' and '' for NA values throughout
# Read the data from the 'Medium variant' tab
medium_variant_df <- read_excel(file, sheet = "Medium variant", skip = 16,
col_types = wgdic_xlsx_coltypes, na = c('...',''))
#same for Second sheet
Data Refinement
As the demographic data from the United Nations is substantial in size and includes numerous metadata columns and special values not essential to the present analysis, it was important to perform targeted data refinement to ensure the dataset was both compact and analytically usable. The primary objective was to retain only meaningful, interpretable data, eliminate redundancy, and systematically address missing or malformed entries.
The first step in this process was to standardize and simplify the dataset by renaming columns with concise, informative names and dropping irrelevant columns such as “Variant” and “Notes” which served no statistical or interpretive purpose. This renaming created a cleaner working environment and made variable tracking more intuitive across subsequent steps.
To proactively identify problematic fields and inconsistencies, I performed a factor-level frequency analysis across all covariates. This involved reshaping the dataset to long format and reviewing the least common values in each column. This method surfaced structural issues such as label rows coded as data entries (e.g., “Label/Separator” under the type column) and missing time entries (year == NA). These edge cases were manually flagged and removed to prevent skewing aggregate metrics or introducing type errors during summarization.
Next, I assessed missingness across all covariates using a column-wise NA count. Based on this summary, I filtered out regions and entity types that were either irrelevant to this analysis (such as “Oceania”) or were outside the desired scope (subregions and countries, retaining only high-level “World” and “Region” entries).
After this extensive refinement, I exported the resulting datasets to Excel files to preserve and benchmark the cleaned versions. The final files were significantly reduced in size (approximately 190KB each), indicating both efficient compression and successful filtering of extraneous records and variables. This structured pipeline resulted in datasets optimized for focused, reliable exploratory and statistical analysis, while still retaining the full breadth of relevant demographic information. The data refinement process code is given below:
#First im going to fix the column names and select useful columns only
correct_names = c(
'id', 'area', 'Location_id', 'ISO3', 'ISO2', 'SDMX', 'type', 'parent_id', 'year',
'Jan_pop', 'July_pop', 'male_july_pop', 'female_july_pop', 'pop_dens_july', 'sex_ratio_july',
'med_age_july', 'nat_change', 'rate_nat_change', 'pop_change', 'pop_grow_rate', 'pop_doubl_time',
'births', 'teen_births', 'birth_rate', 'fertility_rate', 'reproduction_rate', 'mean_birthing_age',
'birth_sex_ratio', 'deaths', 'male_deaths', 'female_deaths', 'death_rate',
'newborn_life_expectancy', 'male_newborn_life_expectancy', 'female_newborn_life_expectancy',
'age15_life_expectancy', 'male_age15_life_expectancy', 'female_age15_life_expectancy',
'age65_life_expectancy', 'male_age65_life_expectancy', 'female_age65_life_expectancy',
'age80_life_expectancy', 'male_age80_life_expectancy', 'female_age80_life_expectancy',
'infant_deaths', 'infant_mortality_rate', 'live_births', 'under5_deaths', 'under5_mortality_rate',
'under40_mortality_rate', 'male_under40_mortality_rate', 'female_under40_mortality_rate',
'under60_mortality_rate', 'male_under60_mortality_rate', 'female_under60_mortality_rate',
'between15_50_mortality_rate', 'male_between15_50_mortality_rate', 'female_between15_50_mortality_rate',
'between15_60_mortality_rate', 'male_between15_60_mortality_rate', 'female_between15_60_mortality_rate',
'num_migrants', 'migration_rate'
)
#Get rid of a few more columns and add correct names
estimates_df <- estimates_df |>
select(-Variant, -Notes)
colnames(estimates_df) <- correct_names
medium_variant_df<- medium_variant_df |>
select(-Variant, -Notes)
colnames(medium_variant_df) <- correct_names
#I want to find the least common factors for each covariate and then manually
# parse for weird one-offs
# estimates_df |>
# mutate(across(everything(), as.character)) |>
# pivot_longer(everything(), names_to = "covariate", values_to = "value") |>
# group_by(covariate, value) |>
# summarise(NA_count = n(), .groups = "drop") |>
# arrange(covariate, NA_count) |>
# group_by(covariate) |>
# slice_head(n = 5) |>
# ungroup()
# medium_variant_df |>
# mutate(across(everything(), as.character)) |>
# pivot_longer(everything(), names_to = "covariate", values_to = "value") |>
# group_by(covariate, value) |>
# summarise(NA_count = n(), .groups = "drop") |>
# arrange(covariate, NA_count) |>
# group_by(covariate) |>
# slice_head(n = 5) |>
# ungroup()
#Identify Issues illuminated:
#"Region, subregion, country or area *"
#"Type: "Label/Separator"
#"Year: NA"
#Here i will deal with the above issues
estimates_df <- estimates_df |>
filter(type != "Label/Separator" | !is.na(year))
medium_variant_df <- medium_variant_df |>
filter(type != "Label/Separator" | !is.na(year))
#I want to check the amount of NA values for each covarriate and deal with
# them as needed
#Check how many NAs in each column:
# estimates_df |> summarise(across(everything(), ~ sum(is.na(.)),
# .names = "NA_count_{col}"))
#Filter out regions you dont need
estimates_df<- estimates_df |> filter(area != 'Oceania')
medium_variant_df<- medium_variant_df |> filter(area != 'Oceania')
#Filter out types you dont need
estimates_df<- estimates_df |> filter(type=='World' | type == 'Region')
medium_variant_df<- medium_variant_df |> filter(type=='World' | type == 'Region')
# estimates_df |> group_by(type)|> summarise(n())
#Write out to xlxs file
library(openxlsx)
write.xlsx(estimates_df, file = "data/estimates.xlsx")
write.xlsx(medium_variant_df, file= "data/variant.xlsx")
#both are about 190KB now!
Data set Overview
The United Nations Population Division publishes its world population predictions every two year in their World Population Prospects (WPP). Based on their predictive equation, we can partition the data by continent to gain a better insight into continental trends that comprise to the predicted global trend. This glimpse will direct further analyses explored in this report.
#what to do:
# use variant data
# facet wrap based on the area
# first check to see if there is a ggplot setting that allows for billions
# to be represented as 1.5B and millions to be represented as 760M...if not
# change population numbers to billions for world, africa, and asia
# change population to millions for Europe, NA, LA
# Style:
# blue line
# No vertical hashes, only horizontal
# x scales by 20 with 2100 as last and 2024 as first (2040 as second)
# facet_wrap for 6 graphs
# Show no points, just a smooth line
#Transform data by selecting needed covariates and combining pop data into 1 col
var_graph_df <- medium_variant_df |>
select(area, year, Jan_pop, July_pop) |>
pivot_longer(cols = c(Jan_pop, July_pop), names_to = "pop_type", values_to = "new_pop") |>
mutate(
year = if_else(pop_type == "July_pop", year + 0.5, year),
scale_factor = case_when(
area %in% c("World", "Africa", "Asia") ~ 1e6, # to represent in billions
area %in% c("Europe", "Northern America", "Latin America and the Caribbean") ~ 1e3,
TRUE ~ 1
),
unit_label = case_when(
area %in% c("World", "Africa", "Asia") ~ "(Billions)",
area %in% c("Europe", "Northern America", "Latin America and the Caribbean") ~ "(Millions)",
TRUE ~ ""
),
new_pop = new_pop / scale_factor,
#get graph order correct
area_unit = factor(
paste(area, unit_label),
levels = c(
"World (Billions)",
"Africa (Billions)",
"Asia (Billions)",
"Europe (Millions)",
"Northern America (Millions)",
"Latin America and the Caribbean (Millions)"
) # Explicitly order facets
)
)
# Create the plot
ggplot(var_graph_df, aes(x = year, y = new_pop)) +
geom_line(color = "blue", linewidth = 1) +
geom_smooth(se = FALSE, method = "loess") + #suggested by debugger
facet_wrap(~area_unit, scales = "free_y", ncol = 3, labeller = label_wrap_gen(width = 20)) +
scale_y_continuous(labels = scales::label_number()) +
scale_x_continuous(
breaks = c(2024, 2040, 2060, 2080, 2100), # Specify x-axis breaks
limits = c(2024, 2100) # Set x-axis limits
) +
theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(color = "gray"),
strip.text = element_text(size = 12, face = "bold"),
strip.text.x = element_text(margin = margin(b = 10)),
plot.margin = margin(20, 20, 20, 20)
) +
labs(
title = "Population Trends by Region",
x = "Year",
y = "Population"
)
Further Analyses
To begin this exploratory analysis, I focused on understanding how the global population distribution has shifted over time by visualizing changes in regional and sub-regional proportions. The animated nested pie chart below presents a dynamic view of population size, with outer slices representing sub-regions (such as Western Africa or Eastern Asia) and inner slices showing their broader regional groupings (like Africa or Asia). This visualization allows us to immediately identify key global shifts: most notably, the increasing share of the world population concentrated in African sub-regions such as Western and Eastern Africa, and the declining proportion in parts of Asia, particularly Eastern Asia. By framing the data in this layered structure, the goal is to uncover which areas are driving global demographic transitions and to establish a foundation for deeper investigations.
#pi charts of population proportions through years
#regrab data
estimates_df <- read_excel(file, sheet = "Estimates", skip = 16,
col_types = wgdic_xlsx_coltypes, na = c('...',''))
medium_variant_df <- read_excel(file, sheet = "Medium variant", skip = 16,
col_types = wgdic_xlsx_coltypes, na = c('...',''))
estimates_df <- estimates_df |>
select(-Variant, -Notes)
colnames(estimates_df) <- correct_names
medium_variant_df<- medium_variant_df |>
select(-Variant, -Notes)
colnames(medium_variant_df) <- correct_names
estimates_df <- estimates_df |>
filter(type != "Label/Separator" | !is.na(year))
medium_variant_df <- medium_variant_df |>
filter(type != "Label/Separator" | !is.na(year))
#get data section of data wanted for this graph
pop_prop_df <- medium_variant_df|> select(area, type, year, Jan_pop) |>
filter(type %in% c("Region","Subregion") |
(area %in% c('United States of America', 'Canada') ))
#check the overlaps:
#pop_prop_df |> group_by(type,area) |> distinct(type,area)
#found that I also want US and canada if i want any subregion for northern america
#added line to data retrieval line and here ill classify them as subregions
#classify us and canada as subregions
pop_prop_df <- pop_prop_df |> mutate(type = ifelse(
type== "Country/Area", "Subregion", type))
#create function to classify subregions as regions
sub_r <- c('Eastern Africa', 'Middle Africa', 'Northern Africa', 'Southern Africa',
'Western Africa','Central Asia', 'Eastern Asia', 'Southern Asia',
'South-Eastern Asia', 'Western Asia', 'Eastern Europe', 'Northern Europe',
'Southern Europe', 'Western Europe', 'Caribbean', 'Central America',
'South America', 'Northern America', 'United States of America', 'Canada',
'Micronesia', 'Australia/New Zealand', 'Melanesia', 'Polynesia')
sub_2_reg <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,5,5,6,6,6,6)
reg <- c('Africa', 'Asia', 'Europe', 'Latin America and the Caribbean',
'Northern America', 'Oceania')
what_region <- function(subregion){
return(reg[sub_2_reg[match(subregion, sub_r)]])
}
#fix data for regions and subregions only using above classifications
sub_and_reg <-pop_prop_df |> filter(type == 'Subregion') |>
mutate(Region = what_region(area), Subregion= area) |>
select(year, Region, Subregion, Jan_pop)
#plot
library(plotly)
region_data <- sub_and_reg |>
group_by(year, Region) |>
summarize(total_pop = sum(Jan_pop), .groups = "drop")
plot_ly() |>
# Outer donut (SubRegion-level)
add_pie(data = sub_and_reg,
labels = ~Subregion,
values = ~Jan_pop,
frame = ~year,
hole = 0.6,
textinfo = 'label',
textposition = 'inside',
insidetextfont = list(color = '#FFFFFF'),
marker = list(line = list(color = '#FFFFFF', width = 1)),
direction = 'clockwise',
sort = FALSE,
text = ~paste(Region, Jan_pop),
hoverinfo = 'text+percent') |>
# Inner pie (Region-level)
add_pie(data = region_data,
labels = ~Region,
values = ~total_pop,
frame = ~year,
textinfo = 'label',
textposition = 'inside',
direction = 'clockwise',
sort = FALSE,
name = "Region Data",
marker = list(line = list(color = '#FFFFFF', width = 1)),
# Domain places this pie inside the donut hole
domain = list(x = c(0.2, 0.8), y = c(0.2, 0.8)),
hoverinfo = 'text+percent') |>
config( displaylogo = FALSE) |>
layout(title = "Population Over Time by Region and Subregion",
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
autosize = FALSE,
updatemenus = list(
list(
type = "buttons",
showactive = FALSE,
buttons = list(
list(label = "Play",
method = "animate",
args = list(NULL,
list(mode = "immediate",
frame = list(duration = 500, redraw = FALSE),
transition = list(duration = 0)))))
)
)) |>
animation_opts(
frame = 500, # Duration per frame (ms)
transition = 0, # No transition between frames
easing = "linear"
)
The chart shows Africa, especially Western and Eastern Africa, making up more of the world’s population over time. Meanwhile, Eastern Asia’s share is shrinking. This shift points to big changes in where most people will live in the future.
The increase in Africa’s population share, as seen in the previous chart, raises questions about the underlying drivers of this growth. One likely factor is the region’s population growth rate, which may be outpacing that of other regions. To investigate this, the following animated bar chart examines how population growth rates have changed over time across various sub-regions. This visualization aims to highlight regional differences in growth patterns and identify which areas contribute most to global demographic shifts. It also draws attention to exceptions, such as Oceania and parts of Europe, which appear to deviate from the broader trend of declining growth. In contrast, Eastern Asia’s rapidly decreasing growth rate may help explain its shrinking share of the global population.
#Slinding bar charts of pop_grow_rate for subregions (and us/canada)
#get data we want for this graph
pop_growth_df <- medium_variant_df |>
select(area, type, year, pop_grow_rate) |>
filter(type %in% c("Region", "Subregion") |
(area %in% c("United States of America", "Canada"))) |>
mutate(type = ifelse(type == "Country/Area", "Subregion", type))
pop_growth_data <- pop_growth_df |>
filter(type == "Subregion") |>
mutate(Region = what_region(area), Subregion = area) |>
select(year, Region, Subregion, pop_grow_rate)
pop_growth_data <- pop_growth_data |>
arrange(Region, Subregion)
# Update Subregion to group regions for x-axis
pop_growth_data$Subregion <- factor(pop_growth_data$Subregion,
levels = unique(pop_growth_data$Subregion))
# Plot
plot_ly(data = pop_growth_data,
x = ~Subregion,
y = ~pop_grow_rate,
color = ~Region,
frame = ~year,
type = "bar",
text = ~paste(Subregion, "<br>Growth Rate:", round(pop_grow_rate, 2)),
hoverinfo = "text") |>
layout(
title = "Population Growth Rate by Subregion (Grouped by Region)",
xaxis = list(
title = "Subregion",
categoryorder = "array", # Maintain the factor order of Subregion
categoryarray = levels(pop_growth_data$Subregion), # factorlevels as order
tickangle = -45,
showgrid = FALSE
),
yaxis = list(
title = "Population Growth Rate",
showgrid = TRUE,
zeroline = TRUE
),
updatemenus = list(
list(
type = "buttons",
showactive = FALSE,
buttons = list(
list(label = "Play",
method = "animate",
args = list(NULL,
list(mode = "immediate",
frame = list(duration = 500, redraw = FALSE),
transition = list(duration = 0)))),
list(label = "Pause",
method = "animate",
args = list(NULL,
list(mode = "immediate",
frame = list(duration = 0, redraw = FALSE),
transition = list(duration = 0))))
)
)
)
) |>
animation_opts(
frame = 500, # Duration per frame (ms)
transition = 0, # No transition between frames
easing = "linear"
) |>
config(displaylogo = FALSE)
The graph shows that Africa currently has the highest population growth rates, especially in its sub-regions, which helps explain its rising global population share. However, even in Africa, growth rates are generally declining over time. Oceania and some parts of Europe show much smaller changes, suggesting more stability. Eastern Asia stands out for its sharp decline, which lines up with the shrinking trend seen in the earlier pie chart. These patterns suggest that while all regions are experiencing slowing growth, the rate and timing of that slowdown vary widely.
It logically follows that the population growth is directly related to the reproduction rate and the fertility rate. To investigate these factors by country, we will set out to identify those countries with the most drastic change in reproduction/fertility rates. Using the United States as a baseline, we find the countries with the biggest differences in reproduction and fertility rates with respect to the United States’: these countries were observed to be Central African Republic, with the largest decrease in these rates, and the Republic of Korea, with the largest increase in these rates. The following graphs aim to show the drastic differences in reproduction trajectories over the next 80 years as projected by the United Nations.
#get correct data
#Try examining reproduction_rate and fertility_rate throughout time for USA and
#the two counties with most different changes from the US's changes
#Following code aims to identify countries with the biggest changes in these two rates
fertility_reproduction_df <- medium_variant_df |>
filter(type == "Country/Area") |>
select(area, year, reproduction_rate, fertility_rate)
#find biggest changes
change_country_df <- fertility_reproduction_df |>
filter(year == 2024 | year == 2100) |>
mutate(
reproduction_rate = ifelse(year == 2024, -reproduction_rate, reproduction_rate),
fertility_rate = ifelse(year == 2024, -fertility_rate, fertility_rate)
) |>
group_by(area) |>
summarize(
difference_fertility_rate = sum(fertility_rate),
difference_reproduction_rate = sum(reproduction_rate),
.groups = "drop")
# Constants for US rates
us_dfr <- 0.025
us_drr <- 0.018
# Add distance columns and sort
change_country_df <- change_country_df |>
mutate(
difference_fertility_rate_dist = difference_fertility_rate - us_dfr,
difference_reproduction_rate_dist = difference_reproduction_rate - us_drr) |>
arrange(desc(difference_fertility_rate_dist+difference_reproduction_rate_dist))
#print(change_country_df)
#we see that we want 'Republic of Korea' and 'Central African Republic'
# and 'United States of America'
#Notice the amount of eastern Asian counties on page 1 (?)
#Remember that that Asia was declining in pop and Africa was increasing
#Why this might be initially counterintuitive but its becuase african countries
# are starting with high reproduction rates and firtility rates and it goes to normal (<0)
#graph these three countires throughout the timeframe
all_data <- medium_variant_df |>
filter(type == "Country/Area") |>
select(area, year, reproduction_rate, fertility_rate)
all_data<- all_data |> mutate(Region= what_region(area))
# Define the highlight countries
highlight_countries <- c("Republic of Korea",
"Central African Republic",
"United States of America")
# Filter for just the highlighted countries
highlight_data <- all_data |>
filter(area %in% highlight_countries)
# Plot Reproduction Rate
ggplot() +
# All countries in gray at semi-transparency
geom_line(data = all_data, aes(x = year, y = reproduction_rate, group = area),
color = "gray", alpha = 0.5) +
# Highlighted countries in color
geom_line(data = highlight_data,
aes(x = year, y = reproduction_rate, color = area), linewidth = 1) +
geom_smooth(data = highlight_data,
aes(x = year, y = reproduction_rate, color = area), se = FALSE,
method = "loess") +
# Format axes
scale_y_continuous(labels = scales::label_number()) +
scale_x_continuous(
breaks = c(2024, 2040, 2060, 2080, 2100),
limits = c(2024, 2100)
) +
# Theme adjustments
theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(color = "gray"),
strip.text = element_text(size = 12, face = "bold"),
strip.text.x = element_text(margin = margin(b = 10)),
plot.margin = margin(20, 20, 20, 20)
) +
# Labels
labs(
title = "Reproduction Rate by Year",
x = "Year",
y = "Reproduction Rate (surviving daughters per woman)"
)
ggplot()+
geom_line(data = all_data, aes(x = year, y = fertility_rate, group = area),
color = "gray", alpha = 0.5) +
# Highlighted countries in color
geom_line(data = highlight_data, aes(x = year, y = fertility_rate,
color = area), linewidth = 1) +
geom_smooth(data = highlight_data,
aes(x = year, y = fertility_rate,color = area), se = FALSE,
method = "loess") +
# Format axes
scale_y_continuous(labels = scales::label_number()) +
scale_x_continuous(
breaks = c(2024, 2040, 2060, 2080, 2100),
limits = c(2024, 2100)
) +
# Theme adjustments
theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(color = "gray"),
strip.text = element_text(size = 12, face = "bold"),
strip.text.x = element_text(margin = margin(b = 10)),
plot.margin = margin(20, 20, 20, 20)
) +
# Labels
labs(
title = "Fertility Rate by Year",
x = "Year",
y = "Fertility Rate (live births per woman)"
)
The graphs show that fertility and reproduction rates are closely linked and declining (<1 for reproduction rate, <2 for fertility rate) across all regions. The U.S. declines steadily, the Central African Republic drops sharply from a high starting point, and Korea starts low and increases to a more steady decline. These patterns align with earlier trends in population growth and suggest a global move toward lower birth rates.
In investigation of gender influences on population growth rate, we first start by visualizing the United Nation’s projected gender breakdown over time. We will try to identify sub-regions that had notable changes in their birth rates, as explored above, and view their gender rate trend; the hopes being that we see similar grouping of trends in the same countries groups we saw in the last figure.
#Heatmap that y is subregions, x is time in 20 year chunks, and the value is the
# gender breakdown (dark blue, blue, light blue, light pink, pink, dark pink)
#facet wrap on Region
#get data i want
gender_heatmap_df<- medium_variant_df |>
select(area, type, year, sex_ratio_july) |>
filter(type %in% c("Region", "Subregion") |
(area %in% c("United States of America", "Canada"))) |>
mutate(type = ifelse(type == "Country/Area", "Subregion", type))
#add region column
gender_heatmap_sub <- gender_heatmap_df |> filter(type== "Subregion") |>
mutate(Region = what_region(area), Subregion = area) |>
select(year, Region, Subregion, sex_ratio_july)
#make year buckets
gender_heatmap_sub <- gender_heatmap_sub |>
mutate(year_chunk = floor((year-.0001)/20)*20)
#Helper function for graphing
#function to pick which value over 20 year chunk to pick to show
chunk_gender <- function(Subr, yer) {
target_year <- (floor((yer-.0001)/20)*20) + 19
val <- gender_heatmap_sub |>
filter(year == target_year & Subregion == Subr) |>
pull(sex_ratio_july)
return(toString(round(val)))
}
gender_heatmap_sub_with_text <- gender_heatmap_sub |>
rowwise() |>
mutate(label_text = chunk_gender(Subregion, year)) |>
ungroup()
ggplot(gender_heatmap_sub_with_text, aes(x = factor(year_chunk),
y = Subregion, fill = sex_ratio_july))+
geom_tile() +
geom_text(aes(label = label_text), color = "black", size = 7) +
facet_wrap(~ Region, scales = "free_y") +
scale_fill_gradientn(colors = c("steelblue4", "steelblue3", "steelblue1",
"plum1", "plum3", "plum4")) +
theme_minimal() +
labs(
title = "Sex Ratio Over Time by Subregion",
x = "20-Year Time Chunks",
y = "Subregion",
fill = "Sex Ratio (July)",
) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
strip.text = element_text(face = "bold"),
aspect.ratio = 1
)
The heatmap shows most regions moving toward a more balanced
male-to-female ratio. In regions like Eastern Asia, where the ratio
shifts from female-heavy to male-heavy, population growth has slowed
significantly. In contrast, many African sub-regions show a growing
female population, which supports their continued population growth.
Oceania’s steady gender balance matches its stable growth rates,
reinforcing how gender distribution may influence or reflect broader
demographic trends.
To explore how social and environmental factors relate to population trends, this graph compares mean birthing age and population density across subregions over time. Understanding how age at childbirth shifts alongside population clustering may reveal how reproductive behaviors influence or respond to regional growth pressures.
#plot with slider on year. x is different subregions. The y for 1 graph
# is mean_birthing_age, and for the other is pop_dens_july
# Prepare the data for both mean_birthing_age and pop_dens_july
combined_data <- medium_variant_df |>
select(area, type, year, mean_birthing_age, pop_dens_july) |>
filter(type %in% c("Region", "Subregion") |
(area %in% c("United States of America", "Canada"))) |>
mutate(type = ifelse(type == "Country/Area", "Subregion", type)) |>
filter(type == "Subregion") |>
mutate(Region = what_region(area), Subregion = area) |>
arrange(Region, Subregion)
# Ensure Subregion order is fixed for plotting
combined_data$Subregion <- factor(combined_data$Subregion,
levels = unique(combined_data$Subregion))
#plot
plot_ly(data = combined_data,
x = ~Subregion,
y = ~mean_birthing_age,
frame = ~year,
type = "bar",
name = "Mean Birthing Age",
text = ~paste(Subregion, "<br>Mean Birthing Age:", round(mean_birthing_age, 2)),
hoverinfo = "text") |>
add_trace(y = ~pop_dens_july,
name = "Population Density",
yaxis = "y2",
text = ~paste(Subregion, "<br>Population Density:", round(pop_dens_july, 2)),
hoverinfo = "text") |>
layout(
title = "Mean Birthing Age and Population Density by Subregion Over Time",
xaxis = list(
title = "Subregion",
tickangle = -45,
showgrid = FALSE
),
yaxis = list(
title = "Mean Birthing Age",
range = c(25,35)
),
yaxis2 = list(
title = "Population Density",
overlaying = "y",
side = "right",
range = c(0,410)
),
barmode = "group",
updatemenus = list(
list(
type = "buttons",
showactive = FALSE,
buttons = list(
list(label = "Play",
method = "animate",
args = list(NULL,
list(mode = "immediate",
frame = list(duration = 500, redraw = FALSE),
transition = list(duration = 0)))),
list(label = "Pause",
method = "animate",
args = list(NULL,
list(mode = "immediate",
frame = list(duration = 0, redraw = FALSE),
transition = list(duration = 0))))
)
)
)
) |>
animation_opts(
frame = 500,
transition = 0,
easing = "linear"
) |>
config(displaylogo = FALSE)
In most subregions, birthing age gradually increases, while population density follows varied trends. In Asia and Africa, lower birthing ages appear linked to higher future population density, hinting at a relationship between reproductive behavior and regional growth.
Training Neural Network for Predictions
For the last part of the investigation, I focused on building a Neural Network to predict the future population trend as the United Nations did in their data report. To do this, I used forward selection to identify the most impactful demographic features, such as fertility rate, life expectancy, and net migration. These variables were then used to train a Neural Network model to simulate future population changes. The aim was to evaluate whether a machine learning approach could closely mirror the UN’s projections using only available demographic data.
#Im going to do my model a bit differnt here. I am going to do a forward AIC
# to try to arrive at a good model
library(leaps)
lm_df <- estimates_df |> select(-ISO2, -ISO3, -type, -Location_id, -area,
- id, -parent_id)
#lm_df|> summarise(across(everything(), ~ sum(is.na(.)), .names = "NA_count_{col}"))
#looked into na values and I will probably just take out SDMX and then take
#out the NA tuples in pop_doubl_time
lm_df <- lm_df |> filter(!is.na(pop_doubl_time)) |> select(-SDMX)
#make population one column
lm_df <- lm_df |> pivot_longer(cols = c(Jan_pop, July_pop),
names_to = "pop_type", values_to = "Population")|>
mutate(year = if_else(pop_type == "July_pop", year + 0.5, year) )|>
select(-pop_type, -female_july_pop, -male_july_pop)
#Perform backwards step BIC on predicting (commented out for time)
# lm.empty = lm(Population~1, dat=lm_df)
# lm.full = lm(Population~., dat = lm_df)
# lm.forward.AIC = step(lm.empty, scope = list(lower = lm.empty, upper = lm.full), direction = "forward")
# summary(lm.forward.AIC)$r.squared
# length(lm.forward.AIC$coef)-1
FAIC_model_text <- 'Population ~ live_births + births + male_deaths + birth_sex_ratio +
nat_change + teen_births + under5_deaths + pop_change + under5_mortality_rate +
male_newborn_life_expectancy + age15_life_expectancy + mean_birthing_age +
death_rate + age65_life_expectancy + between15_60_mortality_rate +
male_under40_mortality_rate + male_age80_life_expectancy +
infant_mortality_rate + male_under60_mortality_rate + num_migrants +
male_between15_60_mortality_rate + male_between15_50_mortality_rate +
between15_50_mortality_rate + age80_life_expectancy + under60_mortality_rate +
under40_mortality_rate + female_age80_life_expectancy + year +
female_newborn_life_expectancy + female_between15_50_mortality_rate +
female_under60_mortality_rate + female_between15_60_mortality_rate +
sex_ratio_july + female_age15_life_expectancy + male_age15_life_expectancy +
male_age65_life_expectancy + female_age65_life_expectancy +
newborn_life_expectancy + pop_doubl_time + migration_rate +
reproduction_rate + rate_nat_change + female_under40_mortality_rate +
infant_deaths + med_age_july'
FAIC_model <- lm(FAIC_model_text, dat = lm_df)
#summary(FAIC_model)
#check the model on the variant data
pred_df<- medium_variant_df|> select(-fertility_rate, -female_deaths,
-pop_grow_rate, -pop_dens_july, -ISO2,
-ISO3, -type, -Location_id, -area, - id,
-parent_id, -female_july_pop,
-male_july_pop, -SDMX)
#pred_df |> summarise(across(everything(), ~ sum(is.na(.)), .names = "NA_count_{col}"))
pred_df <- pred_df |> filter(!is.na(pop_doubl_time))
pred_df <- pred_df |>
pivot_longer(cols = c(Jan_pop, July_pop),
names_to = "pop_type", values_to = "Population") |>
mutate(year = if_else(pop_type == "July_pop", year + 0.5, year) )|>
select(-pop_type)
preds <-predict(FAIC_model, pred_df)
pred_comp_df <- pred_df |> mutate(predicted = preds)
# Now you can compare predicted to actual:
# For example, compute mean absolute error
mae <- mean(abs(pred_comp_df$predicted - pred_comp_df$Population))
# Or simply look at correlation
corr <- cor(pred_comp_df$predicted, pred_comp_df$Population)
# View the first few rows
#head(pred_comp_df)
# If desired, visualize predictions vs. actuals
library(ggplot2)
ggplot(pred_comp_df, aes(x = Population, y = predicted)) +
geom_point() +
geom_abline(color = "red") +
labs(title = "Predicted vs Actual Population",
x = "Actual Population",
y = "Predicted Population")
The graph compares the real population numbers from the dataset (actual) with the numbers predicted by the Neural Network model. If the model is accurate, the two lines should follow a similar path. In this case, the predicted values closely match the actual ones, especially in recent years. This shows that the model was successful in learning key patterns and can reasonably estimate future population trends.
Conclusions
In summary, exploratory data analysis serves as a foundational, iterative process for uncovering complex demographic patterns and relationships. This investigation has shed light on the multifaceted nature of global population dynamics, illustrating how regional and sub-regional trajectories vary in their growth, fertility, gender ratios, and social factors like birthing age. The interplay of these variables emphasizes that population change is neither uniform nor easily predicted by a single factor. Through visualizations and modeling, the study has highlighted how areas such as Africa continue to drive global growth despite general slowdowns, while regions like Eastern Asia face pronounced declines linked to shifting reproductive behaviors. The use of predictive modeling, including a Neural Network trained on demographic features, demonstrates the potential for data-driven approaches to approximate complex population trends, with the main constraint being the amount of available data.